Exploratory data analysis example - 1C Company
This notebook shows some exploratory data exploration of retail sales data in the context of predicting future sales numbers of items at different retail outlets.
This data was provided by the Russian software publisher and retailer 1C Company, for a Kaggle competition in which the challenge is to predict monthly sales for specific products in specific shops.
The first thing to do is import a standard set of Python packages for data manipulation and analysis. Pandas and NumPy are used to store and manipulate tabular data, Matplotlib and Seaborn are used for visualization, and a few standard Python libraries are imported also.
First we'll list the files in the data directory.
DIR_PATH = "../input/competitive-data-science-predict-future-sales/"
os.listdir(DIR_PATH)
['.ipynb_checkpoints', 'items.csv', 'item_categories.csv', 'sales_train.csv', 'sample_submission.csv', 'shops.csv', 'test.csv']
We'll load all the files into the workspace as pandas dataframes with the same variable names as the csv files, other than "sales_train", which will be renamed to "train" to save time typing. Additionally, two Russian language text columns are replaced with English translated versions>
We have a quick look at first few lines of each of the tables to get a basic idea of their contents.
train.head()
| date | date_block_num | shop_id | item_id | item_price | item_cnt_day | |
|---|---|---|---|---|---|---|
| 0 | 02.01.2013 | 0 | 59 | 22154 | 999.00 | 1.0 |
| 1 | 03.01.2013 | 0 | 25 | 2552 | 899.00 | 1.0 |
| 2 | 05.01.2013 | 0 | 25 | 2552 | 899.00 | -1.0 |
| 3 | 06.01.2013 | 0 | 25 | 2554 | 1709.05 | 1.0 |
| 4 | 15.01.2013 | 0 | 25 | 2555 | 1099.00 | 1.0 |
items.head()
| item_name | item_id | item_category_id | |
|---|---|---|---|
| 0 | ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D | 0 | 40 |
| 1 | !ABBYY FineReader 12 Professional Edition Full... | 1 | 76 |
| 2 | ***В ЛУЧАХ СЛАВЫ (UNV) D | 2 | 40 |
| 3 | ***ГОЛУБАЯ ВОЛНА (Univ) D | 3 | 40 |
| 4 | ***КОРОБКА (СТЕКЛО) D | 4 | 40 |
item_categories.head()
| item_category_name | item_category_id | |
|---|---|---|
| 0 | PC - Headsets / Headphones | 0 |
| 1 | Accessories - PS2 | 1 |
| 2 | Accessories - PS3 | 2 |
| 3 | Accessories - PS4 | 3 |
| 4 | Accessories - PSP | 4 |
shops.head()
| shop_id | shop_name | |
|---|---|---|
| 0 | 0 | ! Yakutsk Ordzhonikidze, 56 francs |
| 1 | 1 | ! Yakutsk TC "Central" fran |
| 2 | 2 | Adygea TC "Mega" |
| 3 | 3 | Balashikha TC "Oktyabr-Kinomir" |
| 4 | 4 | Volga TC "Volga Mall" |
It looks like the training data has been structured as normalized tables for efficiency. For convenience we can merge the training data into a single table by joining them on their shared fields.
train = train.merge(items, on='item_id', how='left')
train = train.merge(item_categories, on='item_category_id', how='left')
train = train.merge(shops, on='shop_id', how='left')
We list the datatypes of the columns to check if they could be improved.
train.dtypes
date object date_block_num int64 shop_id int64 item_id int64 item_price float64 item_cnt_day float64 item_name object item_category_id int64 item_category_name object shop_name object dtype: object
The date field is formatted as a string, we can convert that to the datetime dtype to enable extra datetime features such as grouping by weeks or months.
train["date"] = pd.to_datetime(train["date"], format="%d.%m.%Y")
The other columns have an appropriate datatype but could be downcasted to a lower-precision datatype to save memory without any loss in accuracy. This is done here with a custom downcasting function that downcasts integer columns to the smallest type that can represent all column values, and converts float-type columns to the float32 type.
train = reduce_mem_usage(train)
train.dtypes
date datetime64[ns] date_block_num int8 shop_id int8 item_id int16 item_price float32 item_cnt_day int16 item_name object item_category_id int8 item_category_name object shop_name object dtype: object
Now we have prepared the dataframe we can check it for missing values.
train.isna().sum()
date 0 date_block_num 0 shop_id 0 item_id 0 item_price 0 item_cnt_day 0 item_name 0 item_category_id 0 item_category_name 0 shop_name 0 dtype: int64
There aren't any, good!
The full dataset is assembled, so we can look at the first few rows to get an idea of its contents.
train.head()
| date | date_block_num | shop_id | item_id | item_price | item_cnt_day | item_name | item_category_id | item_category_name | shop_name | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-01-02 | 0 | 59 | 22154 | 999.000000 | 1 | ЯВЛЕНИЕ 2012 (BD) | 37 | Cinema - Blu-Ray | Yaroslavl TC" Altair " |
| 1 | 2013-01-03 | 0 | 25 | 2552 | 899.000000 | 1 | DEEP PURPLE The House Of Blue Light LP | 58 | Music - Vinyl | Moscow TEC" Atrium " |
| 2 | 2013-01-05 | 0 | 25 | 2552 | 899.000000 | -1 | DEEP PURPLE The House Of Blue Light LP | 58 | Music - Vinyl | Moscow TEC" Atrium " |
| 3 | 2013-01-06 | 0 | 25 | 2554 | 1709.050049 | 1 | DEEP PURPLE Who Do You Think We Are LP | 58 | Music - Vinyl | Moscow TEC" Atrium " |
| 4 | 2013-01-15 | 0 | 25 | 2555 | 1099.000000 | 1 | DEEP PURPLE 30 Very Best Of 2CD (Фирм.) | 56 | Music - CD of corporate production | Moscow TEC" Atrium " |
It looks like the rows are individual sales counts for specific shop-item combinations on specific days, with the following columns:
We now use the pandas "describe" method to get a summary of the distribution of values in each numerical column. We'll round values to the nearest integer for clarity and append the number of unique values for each column.
| date_block_num | shop_id | item_id | item_price | item_cnt_day | item_category_id | |
|---|---|---|---|---|---|---|
| count | 2935849 | 2935849 | 2935849 | 2935849 | 2935849 | 2935849 |
| mean | 14 | 33 | 10197 | 890 | 1 | 40 |
| std | 9 | 16 | 6324 | 1729 | 2 | 17 |
| min | 0 | 0 | 0 | -1 | -22 | 0 |
| 25% | 7 | 22 | 4476 | 249 | 1 | 28 |
| 50% | 14 | 31 | 9343 | 399 | 1 | 40 |
| 75% | 23 | 47 | 15684 | 999 | 1 | 55 |
| max | 33 | 59 | 22169 | 307980 | 2169 | 83 |
| nunique | 34 | 60 | 21807 | 19992 | 198 | 84 |
From this we can see:
The very highest-valued entries of the item_price and item_cnt_month are so few in number they can be looked at individually. Judging by a quick inspection, the very highest valued entries are custom orders for large numbers of items and are probably best removed.
We remove outliers and negative values by retaining only rows with item_price and item_cnt_day values within specific thresholds.
train = train[(train["item_price"] > 0) & (train["item_price"] < 50000)]
train = train[(train["item_cnt_day"] > 0) & (train["item_cnt_day"] < 1000)]
The training data contains multiple shop_ids, not all of which are in the test set. To get an overview we can create a figure which plots total sales for each shop by month and displays the shop names in Russian and English. A table with a list of translated shop names is used for this.
A closer look at these plots finds several data cleaning issues, such as:
We will merge the duplicate shops and remove all data from shops not in the test month, for simplicity.
(A similar check of item categories finds no data quality issues)
train["shop_id"] = train["shop_id"].replace({0: 57, 1: 58, 11: 10, 40: 39})
train = train.loc[train.shop_id.isin(test["shop_id"].unique()), :]
Finally, we'll check for remaining duplicate entries in the training data.
train[train.duplicated()]
| date | date_block_num | shop_id | item_id | item_price | item_cnt_day | item_name | item_category_id | item_category_name | shop_name | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1435367 | 2014-02-23 | 13 | 50 | 3423 | 999.0 | 1 | Far Cry 3 (Classics) [Xbox 360, русская версия] | 23 | Games - XBOX 360 | Tyumen SEC "Goodwin" |
| 1496766 | 2014-03-23 | 14 | 21 | 3423 | 999.0 | 1 | Far Cry 3 (Classics) [Xbox 360, русская версия] | 23 | Games - XBOX 360 | Moscow MTRC" Afi Mall " |
| 1671873 | 2014-05-01 | 16 | 50 | 3423 | 999.0 | 1 | Far Cry 3 (Classics) [Xbox 360, русская версия] | 23 | Games - XBOX 360 | Tyumen SEC "Goodwin" |
| 1866340 | 2014-07-12 | 18 | 25 | 3423 | 999.0 | 1 | Far Cry 3 (Classics) [Xbox 360, русская версия] | 23 | Games - XBOX 360 | Moscow TEC" Atrium " |
| 2198566 | 2014-12-31 | 23 | 42 | 21619 | 499.0 | 1 | ЧЕЛОВЕК ДОЖДЯ (BD) | 37 | Cinema - Blu-Ray | St. Petersburg Nevsky Center shopping center |
There are only 5 duplicate entries, but the fact that 4 of them are for the same product suggests that they are errors, so we might as well drop them.
train = train.drop_duplicates()
We can look at the the distributions of target and other features, and look for interesting relationships between features. The aim of this exploration should be to find patterns in the data that could help predict the target value, and identify the types of model that are appropriate for the prediction task.
The challenge is to predict monthly sales totals, so we should create a model training set that reproduces the test format by summing sales for each month.
The format of the test items is a list of all possible combinations of shops and items for shops and items that recorded at least one sale in the test month, i.e. the Cartesian product of these shops and items. We recreate this by summing the sales for the Cartesian product of active shops and items sold in each month.
As before, we merge the items, categories and shops tables so these can be used as predictive features.
df = create_testlike_train(train)
df = df.merge(items, on="item_id", how="left")
df = df.merge(item_categories, on="item_category_id", how="left")
df = df.merge(shops, on="shop_id", how="left")
We show the head of the table to check it looks ok.
df.head()
| date_block_num | shop_id | item_id | item_cnt_month | item_revenue_month | item_name | item_category_id | item_category_name | shop_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 59 | 22154 | 1.0 | 999.0 | ЯВЛЕНИЕ 2012 (BD) | 37 | Cinema - Blu-Ray | Yaroslavl TC" Altair " |
| 1 | 0 | 59 | 2552 | 0.0 | 0.0 | DEEP PURPLE The House Of Blue Light LP | 58 | Music - Vinyl | Yaroslavl TC" Altair " |
| 2 | 0 | 59 | 2554 | 0.0 | 0.0 | DEEP PURPLE Who Do You Think We Are LP | 58 | Music - Vinyl | Yaroslavl TC" Altair " |
| 3 | 0 | 59 | 2555 | 0.0 | 0.0 | DEEP PURPLE 30 Very Best Of 2CD (Фирм.) | 56 | Music - CD of corporate production | Yaroslavl TC" Altair " |
| 4 | 0 | 59 | 2564 | 0.0 | 0.0 | DEEP PURPLE Perihelion: Live In Concert DVD (К... | 59 | Music - Music video | Yaroslavl TC" Altair " |
We plot an initial histogram of the target item_cnt_month feature, with a smoothed distribution estimate.
The distribution clearly has a very large peak close to zero. Creating sales counts for all possible combinations of shop and item each month might lead to lots of entries with zero sales, we should check what proportion of item counts are now zero.
print(f"Proportion of 0-valued targets is {df[df.item_cnt_month==0].shape[0]/df.shape[0]}.")
Proportion of 0-valued targets is 0.8457804166978585.
About 85% of target values are zero, which will clearly dominate any distribution plots. To get an idea of the distribution of non-zero values, we plot the distribution again with zero values removed.
print(f"Proportion of targets greater than 1 is {df[df.item_cnt_month>1].shape[0]/df.shape[0]}.")
Proportion of targets greater than 1 is 0.052576717093067826.
Even with zeros removed, the target is very bottom-heavy, with only around 5% of values above 1, although a small number of items sell much more than this.
The skewness of the distribution makes linear models unsuitable for predicting future sales of items, as assumptions for linear models will not be met. Instead, non-linear models such as decision trees or k-nearest neighbor models would be more suitable.
We can also plot the distribution of items prices. For this, we take the mean price of the item in months in which it was sold.
Prices are also highly skewed towards zero. We can display the distribution more clearly by using a log scale.
Here we can see that the price distribution is approximately lognormal with a peak slightly below 10.
Finally, we can also plot the joint distribution of monthly items sales and mean prices, again using log scales for clarity.
This reveals no strong overall associations between price and sales, although associations may exist in subgroups of the data.
Outliers are also apparent.
Plotting total sales counts per month shows clear downwards and seasonal trends. Mean sales per item, however, shows a less pronounced downward trend.
Mean sales per item can also be decomposed into seasonal and continuous trends using the statsmodels package. This show a clear yearly seasonal trend (particularly a peak around the winter holidays) and an overall downwards trend that can be assumed to related to the rise of internet and digital-only sales.
The overall sales trend is clearly downwards, but there are differences at the item category level. Compare the following trend and seasonal decomposition plots for games for the PS3 and PS4, which show falling and rising trends.
In any case, including a seasonal variable (i.e. a month feature) should help the prediction model capture seasonal trends.
Each item is assigned one of 80+ categories which identify what kind of product it is and what format it is for. Information about these categories could be predictive because different types of item are likely to sell in different amounts.
First we plot mean sales and revenue per item in each category across all shops for the last year of sales data and sort by descending magnitude.
Most obviously this plot shows that items in one particular category - "Gifts - bags, albums, mousepads" - have a much higher sales volume than items in other categories, but comparitively little revenue. This category contains few items and denotes low cost items such as promotional bags and mousemats.
Economic importance may be better represented by plotting summed rather than mean values in each category. Doing so shows that movies and games are the highest-selling categories overall, with PS4 games accounting for the most revenue. As well as being predictive in itself, this information could be useful for deciding what categories to prioritize when building predictive models or allocating promotional resources.
Summed sales can similarly be plotted when data is grouped according to shop.
As with categories, this shows that a small number of shops, paricularly those in Moscow, sell much more items overall than others. All else being equal, an item sold in these shops can be predicted to sell in greater quantities.
Some shops sell more than others, and items in some item categories sell more than others, but are there differences in the relative quantities of items in each category that shops sell?
The individual summed sales per category profile of each shop can be created and is plotted here as a heatmap.
The vertical stripes in the heatmap indicate shops that differ from the mean in some way, but high-dimensional data like this can be difficult to understand without some kind of summary.
Principle component analysis (PCA) lets us decompose high dimensional data into a low dimensional representation that makes it easier to get an overview of patterns in the data. Although the values in the shop-item category sales data do not have the normal distribution assumed by the PCA algorithm, this can be partially corrected by log transforming the data.
Below are the results of the PCA-decomposition of the shop-item_category sales profiles using the implementation in scikit-learn. The top-left plot shows the proportion of the overall variance explained by each component, while the remaining 3 plots show the shops plotted on scatter plots by their scores on the first 6 principle components.
The explained variance plot shows that around 75% of the differences between the shops can be explained by two linear components.
Plotting the shops according to their scores on the first two components shows that the shop with shop_id 55 is particularly unlike the other shops, while plotting the shops according to their other components reveals that shop 12 is also an outlier. This makes intuitive sense, as these are internet or virtual shops rather than physical stores.
Insight into what these components correspond to can be gained by plotting the components that map the component scores to the original data. We do this here for the two largest components, and sort the the component mappings for clarity.
Looking at component 1, we see that this is mostly highly weighted on "digital" categories, i.e. non-physical online downloads, which fits with the internet store 55 having a very high score in this dimension. Further components reveal other dimensions in the differences between shops, such as the sales of music and audiobook products being correlated.
The age of items when they are sold can be approximately calculated by subtracting from the sale date the first date or month on which they were sold.
Total monthly item sales as a function of item age is plotted below for all items, and separately for items in two representative categories.
Plotting total monthly item sales as a function of their ages shows that items tend to sell most when they are new and then decline to a plateau about 1 year later. The slightly lower sales in the first compared to the second month is attributable to items not always being available for the whole first month.
It is also evident that this trend for items to sell most shortly after their release is more evident for some categories, such as movies, compared to others.
Even when taking the decline of sales volume over time into account, it seems likely that products that sell well in one month are likely to also sell well in the following month. A column can be created which contains the sales figures from the previous month for the same shop-item combination.
We can create a regression plot of sales counts as a function of previous months sales, for a sample items which are at least a month old. We use log scales on the axes for clarity and plot the estimate of the central tendency (mean) of item_cnt_month.
Adding feature item_cnt_month_lag_1
Single shop-item sales in individual months are mostly low-valued and tend to be noisy. To reduce this noise, windowing methods can be used to calculate a historical mean of sales as a weighted sum of the sales from multiple previous months. An example showing multiple types of window is shown below.
df = add_rolling_stats(df, ["shop_id", "item_id"], kind="ewm", window=1)
df = add_rolling_stats(df, ["shop_id", "item_id"], window=12)
Creating feature "shop_id_item_id_item_cnt_month_mean_ewm_hl_1" Creating feature "shop_id_item_id_item_cnt_month_mean_rolling_mean_win_12"
df = create_apply_ME(df, ["item_id"], target="item_cnt_month")
Adding feature item_id_item_cnt_month_mean_lag_1
Historical sales figures are obviously not directly available for items in their first month of availability, so information about sales of similar items must be used instead. Ideally, multiple measures of similarity should be found and tested.
The item_category_id field can be used to group similar items, and the shop_id field can be used to find items sold in the same shop. These can be combined with the item_age engineered feature to calculate mean first-month sales counts for products grouped by item category or shop. This is plotted below for the last available year of sales data.
As before for items of all ages, the mean new item sales count can also be calculated for each item category - shop combination, plotted below as a heat map.
In addition to the information contained in the item categories and shop identities, all items have an associated item_name text feature, which contains a short description of the item that often includes things such as its title, format (e.g. PS4 or PS3) and language. Information can be potentially be extracted from this text string to group similar items.
To aid extraction of information from text it is useful to clean the text of irrelevant special characters and punctuation, excess blank spaces, low-information words such as "the", removing diacritics such as accents over letters, and converting all text to lowercase. If necessary this can be performed by regular expression operations, as demonstrated here:
from nltk.corpus import stopwords
all_stopwords = stopwords.words("russian") + stopwords.words("english")
def clean_text(string):
string = re.sub(r"[^\w\s]", "", string)
string = re.sub(r"\s{2,}", " ", string)
tokens = string.lower().split()
tokens = [t for t in tokens if t not in all_stopwords]
return " ".join(tokens)
items["item_name_clean"] = items["item_name"].apply(clean_text)
items.loc[1000:, ["item_name", "item_name_clean"]].head()
| item_name | item_name_clean | |
|---|---|---|
| 1000 | 3D Action Puzzle "Зомби" Уборщик | 3d action puzzle зомби уборщик |
| 1001 | 3D Action Puzzle "Зомби" Шахтер | 3d action puzzle зомби шахтер |
| 1002 | 3D Action Puzzle "Техника" Бомбардировщик | 3d action puzzle техника бомбардировщик |
| 1003 | 3D Action Puzzle "Техника" Вертолет | 3d action puzzle техника вертолет |
| 1004 | 3D Action Puzzle "Техника" Гоночная машинка | 3d action puzzle техника гоночная машинка |
One way that the item_name field could be used is by extracting individual words or n-grams (groups of sequential words) and treating them as individual binary features. Doing this creates a very large number of features so some kind of filtering is necessary, such as specifying minimum numbers of occurences of a word feature, or setting a threshold on a measure of relevance to the target feature, such as correlation.
Below is an example of the 1 and 2-ngrams producted from a single item_name text string.
5 ngrams found in all items
| item_name | ng: jethro | ng: jethro tull | ng: lp | ng: tull | ng: tull lp | |
|---|---|---|---|---|---|---|
| 4000 | JETHRO TULL This Was LP | 1 | 1 | 1 | 1 | 1 |
Items with different item_ids are often related to each other, such as being different versions of the same video game or movie, and so are likely to have related sales figures. This can be taken advantage of by grouping similar items together based on their item_names.
The Python package TheFuzz implements fuzzy string matching to measure the similarity of sequences of words. This similarity measure can be used to group related items together into a single category. This "item name group" feature can then be used as other categorical features.
Below is a short extract of the list of items with their associated name group category designations, showing related items assigned to the same category value.
| item_name | item_name_group | |
|---|---|---|
| 3566 | Fuse [Xbox 360, английская версия] | 1362 |
| 3567 | G Data Internet Security 2013 (1ПК / 1 год) (G... | 1363 |
| 3568 | G Data Internet Security 2013 (3ПК / 1 год) (G... | 1363 |
| 3569 | GABIN The Best Of Gabin 2CD | 1364 |
| 3570 | GABRIEL PETER New Blood Live In London Blu-Ray | 1365 |
| 3571 | GAHAN DAVE & SOULSAVERS Angels & Ghosts | 1366 |
| 3572 | GAHAN DAVE & SOULSAVERS Angels & Ghosts (фирм.) | 1366 |
| 3573 | GALLIANO, RICHARD Piazzola Forever Septet DVD ... | 1367 |
| 3574 | GARBAGE Not Your Kind Of People | 1368 |
| 3575 | GARBAGE Absolute Garbage DVD | 1369 |
Categorical features such as name groups and shop ids have useful predictive information, but the relationship between category values and the target variable is not consistent over the training data, as items, item categories and shops increase or (more often) decline in popularity over time, as can be seen when plotting mean sales of music item categories.
Theoretically, a model could learn the interactions between individual category values and time periods, but for categorical features with high numbers of values, such as item identities and name groups, this could require very complex models and cause problems with overfitting.
To save the predictive model having to learn the individual relationships between individual categories and specific time periods, a useful solution is to turn categorical features into numerical features by reencoding each value of the category as the mean of the target variable for training items in the same category, for some time window. As with individual items, different time windows can be used.
Below are example windowed mean encodings for 3 values of the item category name feature using 3 different temporal windows.
df = add_rolling_stats(df, ["item_category_name"], kind="ewm", window=1)
df = add_rolling_stats(df, ["item_category_name"], window=12)
df = add_rolling_stats(df, ["item_category_name"], window=1)
Creating feature "item_category_name_item_cnt_month_mean_ewm_hl_1" Creating feature "item_category_name_item_cnt_month_mean_rolling_mean_win_12" Creating feature "item_category_name_item_cnt_month_mean_rolling_mean_win_1"
The next step after exploratory data analysis should be the creation of a predictive model. Predictive models are obviously useful for planning actions, but a fitted model can also be used for further data exploration by extracting how the model relates features to the final prediction.
Here we use the powerful model explanation package SHAP (SHapley Additive exPlanations) to gain additional insights about predictors of item sales. The model explained by SHAP is a LightGBM regressor, a gradient boosted decision tree model.
First a pre-prepared training frame is loaded and used to train the LightGBM regressor.
import lightgbm as lgbm
warnings.filterwarnings("ignore", module="lightgbm")
model = lgbm.LGBMRegressor(**params)
categoricals = [
"item_category_id",
"month",
]
# Fit the booster using early stopping
eval_set = [(X_train, y_train), (X_valid, y_valid)]
early_stopping_rounds = 30
model.fit(
X_train,
y_train,
eval_set=eval_set,
eval_metric=["rmse"],
verbose=100,
categorical_feature=categoricals,
early_stopping_rounds=early_stopping_rounds,
)
Training until validation scores don't improve for 30 rounds [100] training's rmse: 0.856163 training's l2: 0.733015 valid_1's rmse: 0.80707 valid_1's l2: 0.651363 [200] training's rmse: 0.752276 training's l2: 0.565919 valid_1's rmse: 0.754368 valid_1's l2: 0.569071 [300] training's rmse: 0.711774 training's l2: 0.506623 valid_1's rmse: 0.744477 valid_1's l2: 0.554246 [400] training's rmse: 0.687697 training's l2: 0.472927 valid_1's rmse: 0.742502 valid_1's l2: 0.551309 [500] training's rmse: 0.671048 training's l2: 0.450306 valid_1's rmse: 0.741354 valid_1's l2: 0.549606 Early stopping, best iteration is: [561] training's rmse: 0.662931 training's l2: 0.439477 valid_1's rmse: 0.740871 valid_1's l2: 0.54889
LGBMRegressor(cat_smooth=45.01680827234465, colsample_bytree=0.8,
learning_rate=0.01, max_bin=214, min_child_samples=27,
min_child_weight=0.021144950289224463, min_data_in_bin=7,
n_estimators=8000, num_leaves=966, subsample=0.6,
subsample_for_bin=300000, subsample_freq=5)
A SHAP explainer object is attached to the fitted model and used to generate SHAP values, which quantify the effect of each of the features on each prediction.
import shap
warnings.filterwarnings("ignore", module="shap")
n = 10000 # Takes around an 1 1/2 hours for 10000 rows
sample = X_train.sample(n)
explainer = shap.Explainer(model)
shap_values = explainer(sample)
shap.initjs()
The beeswarm plots below show a summary plot of the effect of each features on each prediction, and ranks the features by overall importance. Each prediction is represented by a point next to each feature name, with the position of the point on the left-right axis indicating the increase or decrease over the baseline prediction caused by the feature value, and the color of the point indicating the value of the feature. For example, a red marker towards the right hand side of the plot indicates that the feature had a high value and caused an increase in the prediction.
Here, the top feature is _shop_id_item_id_item_cnt_month_mean_ewm_hl1, which is the exponential moving average of previous sales for the same shop-item combination, with the expected relationship of higher previous sales predicting higher future sales. The fact that the model chose this as the most predictive feature confirms the predictive power of the moving average.
The second most predictive feature, _shop_id_item_category_id_new_item_item_cnt_month_mean_rolling_mean_win12 (the name is automatically generated) is a feature that contains the 12-month mean of sales of items with the same category in the same shop, restricted to either new (i.e. first month) items or non-new items depending on whether the item to be predicted is new or not. This is likely to be particularly valuable for predicting sales of new items, a hypothesis which can be explored with further visualizations.
shap.plots.beeswarm(shap_values, max_display=50)
A clearer view of the relationships between the target feature and one or more predictors can be gained with dependence plots.
The following plot shows the relationship between the exponential moving average shop-item sales feature and predictions, showing an approximately linear relationship with increasing variance as feature values increase.
A dependence plot can also vizualize interactions between two predictor features by varying point color according to a second variable. Below is plotted the shop-item_category-new_item 12 month windowed mean, with new items coloured red and non-new items coloured blue. We hypothesized that this feature would be particularly valuable for new items, which is confirmed by the steeper slope of the line that can be traced through the red points in the graph.
SHAP can also automatically calculate which other feature had the greatest interaction with a given feature when generating predictions. The dependence plot below shows the main effect of the "first_item_sale_days" feature, which gives the number of days between the first recorded sale of an item and the first day of the prediction month. This feature shows a main effect of predicting higher sales for newer items, which decreases over time. SHAP identifies as the main interaction feature _shop_cluster_item_id_item_cnt_month_mean_lag1, a feature which contains the mean sales of the same item in similar shops, plotting higher values of this feature in red. Looking at the plot, it can be observed that red points tend to be higher than blue points at low values of the main x-axis feature, and lower at higher values of the main feature, indicating that the main trend of decreasing sales with item age is more pronounced for high-selling items.